Digit Recognizer Dataset

Digit Recognizer Dataset

In this article, we work on the Digit Recognizer dataset which was provided during a Kaggle competition.

Data Description

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive. The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image. Each pixel column in the training set has a name like pixels, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixels are located on row i and column j of a 28 x 28 matrix, (indexing by zero). For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ASCII diagram below. Visually, if we omit the "pixel" prefix, the pixels make up the image like this:

We can divide this data into labels and related pixel values. In doing so,

For example,

Train and Test sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: PyTorch Multinomial Logistic Regression for Multi-Class Classification

Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems.

Setting up Tensor Arrays

Modeling

Fitting the model

Model Performance

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm. Note that due to the size of data, here we don't provide a Cross-validation evaluation. In general, this type of evaluation is preferred.

Predictions

Now, let's take a look at the Test set from the Kaggle dataset.


References

  1. Digit Recognizer Dataset
  2. Precision and recall wikipedia page

  3. Cross-validation: evaluating estimator performance